A Technical Report: Entity Extraction using Both Character-based and Token-based Similarity

نویسندگان

  • Zeyi Wen
  • Dong Deng
  • Rui Zhang
  • Kotagiri Ramamohanarao
چکیده

Entity extraction is fundamental to many text mining tasks such as organisation name recognition. A popular approach to entity extraction is based on matching sub-string candidates in a document against a dictionary of entities. To handle spelling errors and name variations of entities, usually the matching is approximate and edit or Jaccard distance is used to measure dissimilarity between sub-string candidates and the entities. For approximate entity extraction from free text, existing work considers solely character-based or solely tokenbased similarity and hence cannot simultaneously deal with minor variations at token level and typos. In this paper, we address this problem by considering both character-based similarity and token-based similarity (i.e. two-level similarity). Measuring onelevel (e.g. character-based) similarity is computationally expensive, and measuring two-level similarity is dramatically more expensive. By exploiting the properties of the two-level similarity and the weights of tokens, we develop novel techniques to significantly reduce the number of sub-string candidates that require computation of two-level similarity against the dictionary of entities. A comprehensive experimental study on real world datasets show that our algorithm can efficiently extract entities from documents and produce a high F1 score in the range of [0.91, 0.97].

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Un Sistema de Extracción de Información Basado en Ontologías para Documentos en el Dominio de las Tecnologías de Información An Ontology-Based Information Extractor for Data-Rich Documents in the Information Technology Domain

This paper presents an information extraction method, suitable for data-rich documents, based on the knowledge represented in a domain ontology. The extractor combines a fuzzy string matcher and a word sense disambiguation (WSD) algorithm. The fuzzy string matcher finds mentions of terms combining character-level and token-level similarity measures dealing with non-standardized acronyms and inc...

متن کامل

Token Gazetteer and Character Gazetteer for Named Entity Recognition

Named entity recognition (NER) in information extraction (IE) systems is usually based on large gazetteers — datasets of well-known and classified entities. NER is also often performed by independent look-up piece of code, which is considered as a bottleneck of many NER systems. In this paper, we present two approaches for building tree gazetteers for NER; i.e. lookup by token and by character.

متن کامل

Named Entity Recognition in Persian Text using Deep Learning

Named entities recognition is a fundamental task in the field of natural language processing. It is also known as a subset of information extraction. The process of recognizing named entities aims at finding proper nouns in the text and classifying them into predetermined classes such as names of people, organizations, and places. In this paper, we propose a named entity recognizer which benefi...

متن کامل

Estimating the Parameters for Linking Unstandardized References with the Matrix Comparator

This paper discusses recent research on methods for estimating configuration parameters for the Matrix Comparator used for linking unstandardized or heterogeneously standardized references. The matrix comparator computes the aggregate similarity between the tokens (words) in a pair of references. The two most critical parameters for the matrix comparator for obtaining the best linking results a...

متن کامل

A Novel Approach to Conditional Random Field-based Named Entity Recognition using Persian Specific Features

Named Entity Recognition is an information extraction technique that identifies name entities in a text. Three popular methods have been conventionally used namely: rule-based, machine-learning-based and hybrid of them to extract named entities from a text. Machine-learning-based methods have good performance in the Persian language if they are trained with good features. To get good performanc...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره abs/1702.03519  شماره 

صفحات  -

تاریخ انتشار 2015